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Abstract 


We introduce a new nonlinear model for classification, in which we model the joint distribution of 
response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We 
keep the relationship between y and x linear within each component of the mixture. The overall 
relationship becomes nonlinear if the mixture contains more than one component, with different 
regression coefficients. We use simulated data to compare the performance of this new approach 
to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector 
machines. We also evaluate our approach on two classification problems: identifying the folding 
class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve 
predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture com- 
ponents), our model can sometimes provide insight into hidden structure in the data. 


Keywords: mixture models, Dirichlet process, classification 


1. Introduction 


In regression and classification models, estimation of parameters and interpretation of results are 
easier if we assume that distributions have simple forms (e.g., normal) and that the relationship 
between a response variable and covariates is linear. However, the performance of such a model 
depends on the appropriateness of these assumptions. Poor performance may result from assum- 
ing wrong distributions, or regarding relationships as linear when they are not. In this paper, we 
introduce a new model based on a Dirichlet process mixture of simple distributions, which is more 
flexible in capturing nonlinear relationships. 

A Dirichlet process, D(Go,Y), with baseline distribution Go and scale parameter y, is a dis- 
tribution over distributions. Ferguson (1973) introduced the Dirichlet process as a class of prior 
distributions for which the support is large, and the posterior distribution is manageable analyti- 
cally. Using the Polya urn scheme, Blackwell and MacQueen (1973) showed that the distributions 
sampled from a Dirichlet process are discrete almost surely. 

The idea of using a Dirichlet process as the prior for the mixing proportions of a simple dis- 
tribution (e.g., Gaussian) was first introduced by Antoniak (1974). In this paper, we will describe 
the Dirichlet process mixture model as a limit of a finite mixture model (see Neal, 2000, for further 
description). Suppose exchangeable random values y,...,y, are drawn independently from some 
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unknown distribution. We can model the distribution of y as a mixture of simple distributions, with 
probability or density function 


Mo 
N 
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PO) = F, 9c). 
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Here, pc are the mixing proportions, and F'(y,) is the probability or density for y under a distribu- 
tion, F (b), in some simple class with parameters ¢—for example, a normal in which 6 = (u, o). We 
first assume that the number of mixing components, C, is finite. In this case, a common prior for pe 
is a symmetric Dirichlet distribution, with density function 


/C)— 
P(p1,---;Pc) = % 
Ta T(y/c)° Io! 


where pe > 0 and Y pe = 1. The parameters 6, are independent under the prior, with distribution 
Go. We can use mixture identifiers, c;, and represent this model as follows: 


yilci,0 ~ Fh), 
cilpi;,- pc ~ Discrete(pi,..., pc), 
Pi,- pc ~ Dirichlet(y/C,....,Y/C), 
Q% ~ Go. 


(1) 


By integrating over the Dirichlet prior, we can eliminate the mixing proportions, pe, and obtain the 
following conditional distribution for c;: 


Nic +y/C 


le a = 


P(c; = c|c1, <, Ci—1) 
Here, nic represents the number of data points previously (i.e., before the i”) assigned to component 
c. As we can see, the above probability becomes higher as nic increases. 
When C goes to infinity, the conditional probabilities (2) reach the following limits: 











n; 
Pla = cheis) => = 

P(ci #c; for all j < ilcy,...,¢j-1) = i Y 
p= -4 


As a result, the conditional probability for 6;, where 0; = þc;, becomes 


0:01, ...,0;—1 Go, (3) 





ys 9; rs 
i—1+y4 fei = +Y 
where 5g is a point mass distribution at 6. Since the observations are assumed to be exchangeable, 
we can regard any observation, i, as the last observation and write the conditional probability of 0; 
given the other 0; for j Æ i (written @_;) as follows: 


1 Y 
de, 4 Go. 4 
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The above conditional probabilities are equivalent to the conditional probabilities for 8; according 
to the Dirichlet process mixture model (as presented by Blackwell and MacQueen, 1973, using the 
Polya urn scheme), which has the following form: 


yilð; ~ F(8), 
ale ~ G, (5) 
G ~ D(Go,Y). 


That is, If we let 0; = ,,, the limit of model (1) as C — œ becomes equivalent to the Dirichlet 
process mixture model (Ferguson, 1983; Neal, 2000). In model (5), G is the distribution over 0’s, 
and has a Dirichlet process prior, D. Phrased this way, each data point, i, has its own parameters, 
0;, drawn independently from a distribution that is drawn from a Dirichlet process prior. But since 
distributions drawn from a Dirichlet process are discrete (almost surely) as shown by Blackwell and 
MacQueen (1973), the 0; for different data points may be the same. 

The parameters of the Dirichlet process prior are Go, a distribution from which 8’s are sampled, 
and y, a positive scale parameter that controls the number of components of the mixture that will 
be represented in the sample, such that a larger y results in a larger number of components. To 
illustrate the effect of y on the number of mixture components in a sample of size 200, we generated 
samples from four different Dirichlet process priors with y = 0.1,1,5,10, and the same baseline 
distribution Go = N2(0, 10/2) (where J is a 2 x 2 identity matrix). For a given value of y, we first 
sample 8;, where i = 1,...,200, according to the conditional probabilities (3), and then we sample 
y;|0; ~ N2(8;,0.21). The data generated according to these priors are shown in Figure 1. As we can 
see, the (prior) expected number of components in a finite sample increases as y becomes larger. 

With a Dirichlet process prior, we can we find conditional distributions of the posterior distribu- 
tion of model parameters by combining the conditional prior probability of (4) with the likelihood 
F (y;,9;), obtaining 


9;|8_i,y; ~ $ qijõo, + rij, (6) 
j#i 


where H; is the posterior distribution of O based on the prior Go and the single data point y;, and the 
values of the q;; and r; are defined as follows: 


qij = bF(:;,8;), 
by i F (y;,0)dGo(0). 


ri 


Here, b is such that r; +} ;4;4i; = 1. These conditional posterior distributions are what are needed 
when sampling the posterior using MCMC methods, as discussed further in Section 2. 

Bush and MacEachern (1996), Escobar and West (1995), MacEachern and Müller (1998), and 
Neal (2000) have used Dirichlet process mixture models for density estimation. Müller et al. (1996) 
used this method for curve fitting. They model the joint distribution of data pairs (x;,y;) as a Dirich- 
let process mixture of multivariate normals. The conditional distribution, P(y|x), and the expected 
value, E(y|x), are estimated based on this distribution for a grid of x’s (with interpolation) to obtain 
a nonparametric curve. The application of this approach (as presented by Miiller et al., 1996) is 
restricted to continuous variables. Moreover, this model is feasible only for problems with a small 
number of covariates, p. For data with moderate to large dimensionality, estimation of the joint 
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Figure 1: Data sets of size n = 200 generated according to four different Dirichlet process mix- 
ture priors, each with the same baseline distribution, Go = M2 (0, 10%), but different scale 
parameters, y. As Y increases, the expected number of components present in the sam- 
ple becomes larger. (Note that, as can be seen above, when y is large, many of these 
components have substantial overlap.) 


distribution is very difficult both statistically and computationally. This is mostly due to the diffi- 
culties that arise when simulating from the posterior distribution of large full covariance matrices. 
In this approach, if a mixture model has C components, the set of full covariance matrices have 
Cp(p+1)/2 parameters. For large p, the computational burden of estimating these parameters 
might be overwhelming. Estimating full covariance matrices can also cause statistical difficulties 
since we need to assure that covariance matrices are positive semidefinite. Conjugate priors based 
the inverse Wishart distribution satisfy this requirement, but they lack flexibility (Daniels and Kass, 
1999). Flat priors may not be suitable either, since they can lead to improper posterior distributions, 
and they can be unintentionally informative (Daniels and Kass, 1999). A common approach to ad- 
dress these issues is to use decomposition methods in specifying priors for full covariance matrices 
(see for example, Daniels and Kass, 1999; Cai and Dunson, 2006). Although this approach has 
demonstrated some computational advantages over direct estimation of full covariance matrices, it 
is not yet feasible for high-dimensional variables. For example, Cai and Dunson (2006) recommend 
their approach only for problems with less than 20 covariates. 
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We introduce a new nonlinear Bayesian model, which also nonparametrically estimates P(x,y), 
the joint distribution of the response variable y and covariates x, using Dirichlet process mixtures. 
Within each component, we assume the covariates are independent, and model the dependence 
between y and x using a linear model. Therefore, unlike the method of Miiller et al. (1996), our 
approach can be used for modeling data with a large number of covariates, since the covariance 
matrix for one mixture component is highly restricted. Using the Dirichlet process as the prior, our 
method has a built-in mechanism to avoid overfitting since the complexity of the nonlinear model 
is controlled. Moreover, this method can be used for categorical as well as continuous response 
variables by using a generalized linear model instead of the linear model. 

The idea of building a nonlinear model based on an ensemble of simple linear models has been 
explored extensively in the field of machine learning. Jacobs et al. (1991) introduced a supervised 
learning procedure for models that are comprised of several local models (experts) each handling a 
subset of data. A gating network decides which expert should be used for a given data point. For 
inferring the parameters of such models, Waterhouse et al. (1996) provided a Bayesian framework to 
avoid over-fitting and noise level under-estimation problems associated with traditional maximum 
likelihood inference. Rasmussen and Ghahramani (2002) generalized mixture of experts models by 
using infinitely many nonlinear experts. In their approach, each expert is assumed to be a Gaussian 
process regression model, and the gating network is based on an input-dependent adaptation of 
Dirichlet process. Meeds and Osindero (2006) followed the same idea, but instead of assuming that 
the covariates are fixed, they proposed a joint mixture of experts model over covariates and response 
variable. 

Our focus here is on classification models with a multi-category response, in which we have 
observed data for n cases, (x1, 1),.--,(Xn, Yn). Here, the class y; has J possible values, and the covari- 
ates x; can in general be a vector of p covariates. We wish to classify future cases in which only the 
covariates are observed. For binary (J = 2) classification problems, a simple logistic model can be 
used, with class probabilities defined as follows (with the case subscript dropped from x and y): 


exp(a+x"B) 
1 +exp(a+x7B) 





P(y=1|x,0,B) = 


When there are three or more classes, we can use a generalization known as the multinomial logit 
(MNL) model (called “softmax” in the machine learning literature): 


exp(a; +x"B;) 


P(y = j|x,Q, = ; 
= Jb OB) T e 





MNL models are discriminative, since they model the conditional distribution P(y|x), but not 
the distribution of covariates, P(x). In contrast, our dpMNL model is generative, since it estimates 
the joint distribution of response and covariates, P(x,y). The joint distribution can be decompose 
into the product of the marginal distribution P(x) and the conditional distribution P(y|x); that is, 
P(x,y) = P) Pox). 

Generative models have several advantages over discriminative models (see for example, Ulusoy 
and Bishop, 2005). They provide a natural framework for handling missing data or partially labeled 
data. They can also augment small quantities of expensive labeled data with large quantities of 
cheap unlabeled data. This is especially useful in applications such as document labeling and image 
analysis, where it may provide better predictions for new feature patterns not present in the data at 
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Figure 2: An illustration of our model for a binary (black and white) classification problem with 
two covariates. Here, the mixture has two components, which are shown with circles 
and squares. In each component, an MNL model separates the two classes into “black” 
or “white” with a linear decision boundary. The overall decision boundary, which is a 
smooth function, is not shown in this figure. 


the time of training. For example, the Latent Dirichlet Allocation (LDA) model proposed by Blei 
et al. (2003) is a well-defined generative model that performs well in classifying documents with 
previously unknown patterns. 

While generative models are quite successful in many problems, they can be computationally in- 
tensive. Moreover, finding a good (but not perfect) estimate for the joint distribution of all variables 
(i.e., x and y) does not in general guarantee a good estimate of decision boundaries. By contrast, 
discriminative models are often computationally fast and are preferred when the covariates are in 
fact non-random (e.g., they are fixed by an experimental design). 

Using a generative model in our proposed method provides an additional benefit. Modeling the 
distribution of covariates jointly with y allows us to implicitly model the dependency of covariates 
on each other through clustering (i.e., assigning data points to different components), which could 
provide insight into hidden structure in the data. To illustrate this concept, consider Figure 2 where 
the objective is to classify cases into black or white. To improve predictive accuracy, our model 
has divided the data into two components, shown as squares and circles. These components are 
distinguished primarily by the value of the second covariate, x2, which is usually positive for squares 
and negative for circles. For cases in the squares group, the response variable strongly depends on 
both x; and x2 (the linear separator is almost diagonal), whereas, for cases in the circles group, the 
model mainly depends on x, alone (the linear model is almost vertical). Therefore, by grouping the 
data into sub-populations (e.g., circles and squares in this example), our model not only improves 
classification accuracy, but also discovers hidden structure in the data (i.e., by clustering covariate 
observations). This concept is briefly discussed in Section 5, where we use our model to predict 
Parkinson’s disease. A more detailed discussion on using our method to detect hidden structure in 
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data is provided elsewhere (Shahbaba, 2009), where a Dirichlet process mixture of autoregressive 
models us used to analyze time-series processes that are subject to regime changes, with no specific 
economic theory about the structure of the model. 

The next section describes our methodology. In Section 3, we illustrate our approach and evalu- 
ate its performance on simulated data. In Section 4, we present the results of applying our model to 
an actual classification problem, which attempts to identify the folding class of a protein sequence 
based on the composition of its amino acids. Section 5 discusses another real classification prob- 
lem, where the objective is to detect Parkinson’s disease. This example is provided to show how 
our method can be used not only for improving prediction accuracy, but also for identifying hidden 
structure in the data. Finally, Section 6 is devoted to discussion and future directions. 


2. Methodology 


We now describe our classification model, which we call dpMNL, in detail. We assume that for each 
case we observe a vector of continuous covariates, x, of dimension p. The response variable, y, is 
categorical, with J classes. To model the relationship between y and x, we non-parametrically model 
the joint distribution of y and x, in the form P(x,y) = P(x)P(y|x), using a Dirichlet process mixture. 
Within each component of the mixture, the relationship between y and x (i.e., P(y|x)) is expressed 
using a linear function. The overall relationship becomes nonlinear if the mixture contains more 
than one component. This way, while we relax the assumption of linearity, the flexibility of the 
relationship is controlled. 

Each component in the mixture model has parameters 8 = (u,6*,a@,f). The distribution of 
x within a component is multivariate normal, with mean vector u and diagonal covariance, with 
the vector o? on the diagonal. The distribution of y given x within a component is given by a 
multinomial logit (MNL) model—for j = 1,...,J, 


exp(j +x" B;) 


PY= j|x,Q, : 
O=J p) Eri explay +x"B 7) 





The parameter 0; is scalar, and B; is a vector of length p. Note that given x, the distribution of y does 
not depend on u and ©. This representation of the MNL model is redundant, since one of the B;’s 
(where j = 1,...,/) can be set to zero without changing the set of relationships expressible with the 
model, but removing this redundancy would make it difficult to specify a prior that treats all classes 
symmetrically. In this parameterization, what matters is the difference between the parameters of 
different classes. 

In addition to the mixture view, which follows Equation (1) with C — ©, one can also view each 
observation, i, as having its own parameter, 0;, drawn independently from a distribution drawn from 
a Dirichlet process, as in Equation (5): 


O/G ~ G, fori=1,...,n, 
G ~ D(Go,Y). 
Since G will be discrete, many groups of observations will have identical 0;, corresponding to 
components in the mixture view. 


Although the covariates in each component are assumed to be independent with normal priors, 
this independence of covariates exists only locally (within a component). Their global (over all 
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components) dependency is modeled by assigning data to different components (i.e., clustering). 
The relationship between y and x within a component is captured using an MNL model. Therefore, 
the relationship is linear locally, but nonlinear globally. 

We could assume that y and x are independent within components, and capture the dependence 
between the response and the covariates by clustering too. However, this may lead to poor per- 
formance (e.g., when predicting the response for new observations) if the dependence of y on x is 
difficult to capture using clustering alone. Alternatively, we could also assume that the covariates 
are dependent within a component. For continuous response variables, this becomes equivalent to 
the model proposed by Miiller et al. (1996). If the covariates are in fact dependent, using full co- 
variance matrices (as suggested by Müller et al., 1996) could result in a more parsimonious model 
since the number of mixture component would be smaller. However, as we discussed above, this 
approach may be practically infeasible for problems with a moderate to large number of covariates. 
We believe that our method is an appropriate compromise between these two alternatives. 

We define Go, which is a distribution over 6 = (u,67, 0, B), as follows: 


uiluo,So ~ N(u0,99), 
log(07)|Mo;,Vo ~ N(Mo,Vo), 
aj ~ N(0,7°), 
Baly ~ N(0,v°). 


The parameters of Go may in turn depend on higher level hyperparameters. For example, we can 
regard the variances of coefficients as hyperparameters with the following priors: 


log(t”)|Mz, Vz aS N(M:, V2), 
log(v?)|My, W} ~ N(My,V2). 


We use MCMC algorithms for posterior sampling. We could use Gibbs sampling if Go is the 
conjugate prior for the likelihood given by F. That is, we would repeatedly draw samples from 
6;|8_i,y; (where i = 1,...,n) using the conditional distribution (6). Neal (2000) presented several 
algorithms for sampling from the posterior distribution of Dirichlet process mixtures when non- 
conjugate priors are used. Throughout this paper, we use Gibbs sampling with auxiliary parameters 
(Neal’s algorithm 8). 

This algorithm uses a Markov chain whose state consists of c1, ..., Cn and 0 = (Qe : € € {c1,...,Cn}), 
so that 8; = ,,. In order to allow creation of new clusters, the algorithm temporarily supplements 
the , parameters of existing clusters with m (or m — 1) additional parameter values drawn from the 
prior, where m a postive integer that can be adjusted to give good performance. Each iteration of 
the Markov chain simulation operates as follows: 


e Fori=1,...,n: Let k` be the number of distinct c; for j 4 i and let h =k” +m. Label these 
cj with values in {1,...,k~ }. If c; = cj for some j Æ i, draw values independently from Go for 
those p- for which k~ < c < h. If c; # cj for all j Fi, let c; have the label k- +1, and draw 
values independently from Go for those þe where k7 +1<c<h. Draw a new value for c; 
from {1,...,4} using the following probabilities: 


HN 
en | On) forl<c<k, 
P(c; = ¢|c_i,¥i,91, ---, On) y/m 
n1 Virb) fork” <c<h, 
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where n_i c is the number of c; for j Æ i that are equal to c, and b is the appropriate normalizing 
constant. Change the state to contain only those @, that are now associated with at least to 
one observation. 


e For all c € {c1,...,¢n} draw a new value from the distribution e | {y; such that c; = c}, or 
perform some update that leaves this distribution invariant. 


Throughout this paper, we set m = 5. This algorithm resembles one proposed by MacEachern and 
Miiller (1998), with the difference that the auxiliary parameters exist only temporarily, which avoids 
an inefficiency in MacEachern and Miiller’s algorithm. 

Samples simulated from the posterior distribution are used to estimate posterior predictive prob- 
abilities. For a new case with covariates x’, the posterior predictive probability of the response 
variable, y’, is estimated as follows: 


PO = jix’) = Sa 


where 


P(y' = j,x'|Go,0), 


v 
— 
< 

I 
= 

s 

ll 

Ale 
M- 


a 
ll 
= 


P(x'|Go, 0). 


y 
“~~ 
s 
I 
Ale 
M- 


Il 
iar 


Here, S is the number of post-convergence samples from MCMC, and gs) represents the set of 
parameters obtained at iteration s. Alternatively, we could predict new cases using P(y’ = j,x’) = 
IES P(y' = j\x’,Go,0)). While this would be computationally faster, the above approach allows 
us to learn from the covariates of test cases when predicting their response values. Note also that 
the above predictive probabilities include the possibility that the test case is from a new cluster. 

We use these posterior predictive probabilities to make predictions for test cases, by assigning 
each test case to the class with the highest posterior predictive probability. This is the optimal strat- 
egy for a simple 0/1 loss function. In general, we could use more problem-specific loss functions 
and modify our prediction strategy accordingly. 

Implementations for all our models were coded in MATLAB, and are available online at http: 
//www.ics.uci.edu/~babaks/codes. 


3. Results for Simulated Data 


In this section, we illustrate our dpMNL model using synthetic data, and compare it with other 
models as follows: 


e A simple MNL model, fitted by maximum likelihood or Bayesian methods. 


e A Bayesian MNL model with quadratic terms (i.e., xjx,, where / = 1,...,p and k = 1,..., p), 
referred to as qMNL. 


e A decision tree model (Breiman et al., 1993) that uses 10-fold cross-validation for pruning, 
as implemented by the MATLAB “treefit”, “treetest” (for cross-validation) and “treeprune” 
functions. 
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e Support Vector Machines (SVMs) (Vapnik, 1995), implemented with the MATLAB “svm- 
train” and “svmclassify” functions from the Bioinformatics toolbox. Both a linear SVM 
(LSVM) and a nonlinear SVM with radial basis function kernel (RBF-S VM) were tried. 


When the number of classes in a classification problem was bigger than two, LSVM and RBF-SVM 
used the all-vs-all scheme as suggested by Allwein et al. (2000), Fiirnkranz (2002), and Hsu and 
Lin (2002). In this scheme, (5) binary classifiers are trained where each classifier separates a pair 
of classes. The predicted class for each test case is decided by using a majority voting scheme 
where the class with the highest number of votes among all binary classes wins. For RBF-SVM, 
the scaling parameter, A, in the RBF kernel, k(x,x’) = exp(—||x —x’||/2A), was optimized based on 
a validation set comprised of 20% of training samples. 

The models are compared with respect to their accuracy rate and the F} measure. Accuracy rate 
is defined as the percentage of the times the correct class is predicted. F; is a common measurement 
in machine learning defined as: 


1 2A; 
MOS PAT B C: 
ja tB; +C] 





where A; is the number of cases which are correctly assigned to class j, B; is the number cases 
incorrectly assigned to class j, and C; is the number of cases which belong to the class j but are 
assigned to other classes. 

We do two tests. In the first test, we generate data according to the dpMNL model. Our objective 
is to evaluate the performance of our model when the distribution of data is comprised of multiple 
components. In the second test, we generate data using a smooth nonlinear function. Our goal is to 
evaluate the robustness of our model when data actually come from a different model. 


3.1 Simulation 1 


The first test was on a synthetic four-way classification problem with five covariates. Data are 
generated according to our d(pMNL model, except the number of components was fixed at two. Two 
hyperparameters defining Go were given the following priors: 


log(t”) ~ N(0,0.17), 
log(v?) ~ N(0,2?). 


The prior for component parameters 6 = (u,07,0,B) defined by this Go was 


mw ~ N(0,1 

log(o?) ~ N(0,2 
at ~ N(0 
Baly ~ N(O 

where / = 1,...,5 and j = 1,...,4. We randomly draw parameters 6; and 82 for two components as 


described from this prior. For each component, we then generate 5000 data points by first drawing 
xi ~ N (u1,©1) and then sampling y using the following MNL model: 


exp(a; +xB ;) 


P(y= j|x,,B) = Ep expla +.B jr) 
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Model Accuracy (%) F, (%) 

Baseline 45.57 (1.47) | 15.48 (1.77) 
MNL (Maximum Likelihood) | 77.30 (1.23) | 66.65 (1.41) 
MNL 78.39 (1.32) | 66.52 (1.72) 
qMNL 83.60 (0.99) | 74.16 (1.30) 
Tree (Cross Validation) 70.87 (1.40) | 55.82 (1.69) 
LSVM 78.61 (1.17) | 67.03 (1.51) 
RBF-SVM 79.09 (0.99) | 63.65 (1.44) 
dpMNL 89.21 (0.65) | 81.00 (1.23) 











Table 1: Simulation 1: the average performance of models based on 50 simulated data sets. The 
Baseline model assigns test cases to the class with the highest frequency in the training 
set. Standard errors of estimates (based on 50 repetitions) are provided in parentheses. 


The overall sample size is 10000. We randomly split the data into a training set, with 100 data 
points, and a test set, with 9900 data points. We use the training set to fit the models, and use the 
independent test set to evaluate their performance. The regression parameters of the Bayesian MNL 
model with Bayesian estimation and the qMNL model have the following priors: 


alt ~ N(0,7°), 
Bly ~ N(0,v*), 
log(t) ~ N(0,1°), 
log(v) ~ N(0,2°). 


The above procedure was repeated 50 times. Each time, new hyperparameters, t” and v?, and 
new component parameters, 8; and 02, were sampled, and a new data set was created based on these 
0’s. 

We used Hamiltonian dynamics (Neal, 1993) for updating the regression parameters (the a’s 
and $’s). For all other parameters, we used single-variable slice sampling (Neal, 2003) with the 
“stepping out” procedure to find an interval around the current point, and then the “shrinkage” 
procedure to sample from this interval. We also used slice sampling for updating the concentration 
parameter y, We used log(y) ~ N(—3,27) as the prior, which, encourages smaller values of y, and 
hence a smaller number of components. Note that the likelihood for y depends only on C, the 
number of unique components (Neal, 2000; Escobar and West, 1995). For all models we ran 5000 
MCMC iterations to sample from the posterior distributions. We discarded the initial 500 samples 
and used the rest for prediction. 

Our dpMNL model has the highest computational cost compared to all other methods. Simulat- 
ing the Markov chain took about 0.15 seconds per iteration using a MATLAB implementation on an 
UltraSPARC III machine (approximately 12.5 minutes for each simulated data set). Each MCMC 
iteration for the Bayesian MNL model took about 0.1 second (approximately 8 minutes for each 
data set). Training the RBF-SVM model (with optimization of the scale parameter) took approxi- 
mately 1 second for each data set. Therefore, SVM models have a substantial advantage over our 
approach in terms of computational cost. 
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Figure 3: A random sample generated according to Simulation 2, with a3 = 0. The dotted line is 
the optimal boundary function. 


The average results (over 50 repetitions) are presented in Table 1. As we can see, our d(pMNL 
model provides better results compared to all other models. The improvements are statistically 
significant (p-values < 0.001) for comparisons of accuracy rates using a paired f-test with n = 50. 


3.2 Simulation 2 


In the above simulation, since the data were generated according to the dpMNL model, it is not 
surprising that this model had the best performance compared to other models. In fact, as we 
increase the number of components, the amount of improvement using our model becomes more and 
more substantial (results not shown). To evaluate the robustness of the dp MNL model, we performed 
another test. This time, we generated xj1,xj2,xj3 (where i = 1,..., 10000) from the Uniform(0,5) 
distribution, and generated a binary response variable, y;, according the following model: 


1 


Py=lilx) = ; 
O I) 1+exp|a, sin(x} 4 + 1.2) +x; cos(azx2 +0.7) + a3x3 — 2] 





where aj, a2 and a3 are randomly sampled from N(1,0.57). The function used to generate y is a 
smooth nonlinear function of covariates. The covariates are not clustered, so the generated data 
do not conform with the assumptions of our model. Moreover, this function includes a completely 
arbitrary set of constants to ensure the results are generalizable. Figure 3 shows a random sample 
from this model except that a3 is fixed at zero (so x3 is ignored). In this figure, the dotted line is the 
optimal decision boundary. 

Table 2 shows the results for this simulation, which are averages over 50 data sets. For each data 
set, we generated 10000 cases by sampling new values for a1, a2, and a3, new covariates, x, for each 
case, and new values for the response variable, y, in each case. As before, models were trained on 
100 cases, and tested on the remaining 9900. As before, the dpMNL model provides significantly 
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Model Accuracy (%) F, (%) 

Baseline 61.96 (1.53) | 37.99 (0.57) 
MNL (Maximum Likelihood) | 73.58 (0.96) | 68.33 (1.17) 
MNL 73.58 (0.97) | 67.92 (1.41) 
qMNL 75.60 (0.98) | 70.12 (1.36) 
Tree (Cross Validation) 73.47 (0.95) | 66.94 (1.43) 
LSVM Linear 73.09 (0.99) | 64.95 (1.71) 
RBF-SVM 76.06 (0.94) | 68.46 (1.77) 
dpMNL 77.80 (0.86) | 73.13 (1.26) 











Table 2: Simulation 2: the average performance of models based on 50 simulated data sets. The 
Baseline model assigns test cases to the class with the highest frequency in the training 
set. Standard errors of estimates (based on 50 repetitions) are provided in parentheses. 


(all p-values are smaller than 0.001) better performance compared to all other models. This time, 
however, the performances of qMNL and RBF-SVM are closer to the performance of the dpMNL 
model. 


4. Results on Real Classification Problems 


In this section, we first apply our model the problem of predicting a protein’s 3D structure (i.e., 
folding class) based on its sequence. We then use our model to identify patients with Parkinson’s 
disease (PD) based on their speech signals. 


4.1 Protein Fold Classification 


When predicting a protein’s 3D structure, it is common to presume that the number of possible 
folds is fixed, and use a classification model to assign a protein to one of these folding classes. 
There are more than 600 folding patterns identified in the SCOP (Structural Classification of Pro- 
teins) database (Lo Conte et al., 2000). In this database, proteins are considered to have the same 
folding class if they have the same major secondary structure in the same arrangement with the same 
topological connections. 

We apply our model to a protein fold recognition data set provided by Ding and Dubchak (2001). 
The proteins in this data set are obtained from the PDB_select database (Hobohm et al., 1992; 
Hobohm and Sander, 1994) such that two proteins have no more than 35% of the sequence identity 
for aligned subsequences larger than 80 residues. Originally, the resulting data set included 128 
unique folds. However, Ding and Dubchak (2001) selected only the 27 most populated folds (311 
proteins) for their analysis. They evaluated their models based on an independent sample (i.e., test 
set) obtained from PDB-40D (Lo Conte et al., 2000). PDB-40D contains the SCOP sequences with 
less than 40% identity with each other. Ding and Dubchak (2001) selected 383 representatives of 
the same 27 folds in the training set with no more than 35% identity to the training sequences. The 
training and test data sets are available online at http: //crd.lbl.gov/~cding/protein/. 

The covariates in these data sets are the length of the protein sequence, and the percentage 
composition of the 20 amino acids. While there might exist more informative covariates to pre- 
dict protein folds, we use these so that we can compare the results of our model to that of Ding 
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and Dubchak (2001), who trained several Support Vector Machines (SVM) with nonlinear kernel 
functions. 

We centered the covariates so they have mean zero, and used the following priors for the MNL 
model and qMNL model (with no interactions, only x; and x? as covariates): 


2, 
Ja 
E07), 
1), 
3,47). 


ajm ~ N(On 
logn) ~ N(0, 
Bil5,o. ~ N(0, 
(0, 

(— 


2 


log(&?) 
log(o7) 


Here, the hyperparameters for the variances of regression parameters are more elaborate than in the 
previous section. One hyperparameter, 6/, is used to control the variance of all coefficients, B; 
(where j = 1,...,/), for covariate x;. If a covariate is irrelevant, its hyperparameter will tend to be 
small, forcing the coefficients for that covariate to be near zero. This method, termed Automatic 
Relevance Determination (ARD), has previously been applied to neural network models by Neal 
(1996). We also used another hyperparameter, €, to control the overall magnitude of all B’s. This 
way, ©; controls the relevance of covariate x; compared to other covariates, and € controls the overall 
usefulness of all covariates in separating all classes. The standard deviation of B; is therefore equal 
to Eo). 

The above scheme was also used for the dpMNL model. Note that in this model, one ©; controls 
all Bjr, where j = 1,...,J indexes classes, and c = 1,...,C indexes the unique components in the 
mixture. Therefore, the standard deviation of B jle İS E0/V-. Here, ve is specific to each component 
c, and controls the overall effect of coefficients in that component. That is, while o and & are 
global hyperparameters common between all components, Ve is a local hyperparameter within a 
component. Similarly, the standard deviation of intercepts, Oj. in component c is NTe. We used 
N(0, 1) as the prior for ve and Te. 

We also needed to specify priors for u; and ©}, the mean and standard deviation of covariate xy, 
where / = 1,..., p. For these parameters, we used the following priors: 


2 


HiclHo1;01 ~ N(Ho1;99,); 
wor ~ N(0,5*), 
log(6g,) ~ N(0,2°), 
log(o7..)|Mo.1, Vo. a N (Mo, LV O, ae 

Mos, ~ N (0, 1°), 

log(V5,) ~ N(0,2°). 
These priors make use of higher level hyperparameters to provide flexibility. For example, if the 
components are not different with respect to covariate x;, the corresponding variance, 96, p becomes 

small, forcing uc close to their overall mean, uo.. 

The MCMC chains for MNL, qMNL, and dpMNL ran for 10000 iterations. Simulating the 
Markov chain took about 2.1 seconds per iteration (5.8 hours in total) for dp MNL and 0.5 seconds 
per iteration (1.4 hours in total) for MNL using a MATLAB implementation on an UltraSPARC III 
machine. Training the RBF-SVM model took about 2.5 minutes. 
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Model Accuracy (%) | Fi (%) 
MNL 50.0 41.2 
qMNL 50.5 42.1 
SVM (Ding and Dubchak, 2001) 49.4 - 
LSVM 50.5 47.3 
RBF-SVM 53.1 49.5 
dpMNL 58.6 53.0 











Table 3: Performance of models based on protein fold classification data. 


The results for MNL, qMNL, LSVM, RBF-SVM, and dpMNL are presented in Table 3, along 
with the results for the best SVM model developed by Ding and Dubchak (2001) on the exact same 
data set. As we can see, the nonlinear RBF-SVM model that we fit has better accuracy than the 
linear models. Our dpMNL model provides an additional improvement over the RBF-SVM model. 
This shows that there is in fact a nonlinear relationship between folding classes and the composition 
of amino acids, and our nonlinear model could successfully identify this relationship. 


4.2 Detecting Parkinson’s Disease 


The above example shows that our method can potentially improve prediction accuracy, though 
of course other classifiers, such as SVM and neural networks, may do better on some problems. 
However, we believe the application of our method is not limited to simply improving prediction 
accuracy—it can also be used to discover hidden structure in data by identifying subgroups (i.e., 
mixture components) in the population. This section provides an example to illustrate this concept. 

Neurological disorders such as Parkinson’s disease (PD) have profound consequences for pa- 
tients, their families, and society. Although there is no cure for PD at this time, it is possible to 
alleviate its symptoms significantly, especially at the early stages of the disease (Singh et al., 2007). 
Since approximately 90% of patients exhibit some form of vocal impairment (Ho et al., 1998), and 
research has shown that vocal impairment could be one of the earliest indicators of onset of the 
illness (Duffy, 2005), voice measurement has been proposed as a reliable tool to detect and monitor 
PD (Sapir et al., 2007; Rahn et al., 2007; Little et al., 2008). For example, patients with PD com- 
monly display a symptom known as dysphonia, which is an impairment in the normal production of 
vocal sounds. 

In a recent paper, Little et al. (2008) show that by detecting dysphonia, we could identify pa- 
tients with PD. Their study used data on sustained vowel phonations from 31 subjects, of whom 23 
were diagnosed with PD. The 22 covariates used include traditional variables, such as measures of 
vocal fundamental frequency and measures of variation in amplitude of signals, as well as a novel 
measurement referred to as pitch period entropy (PPE). See Little et al. (2008) for a detailed de- 
scription of these variables. This data set is publicly available at UCI Machine Learning Repository 
(http: //archive.ics.uci.edu/ml/datasets/Parkinsons). 

Little et al. (2008) use an SVM classifier with Gaussian radial basis kernel functions to identify 
patients with PD, chosing the SVM penalty value and kernel bandwidth by an exhaustive search over 
a range of values. They also perform an exhaustive search to select the optimal subset of features 
(10 features were selected). Their best model provides a 91.4% (+4.4) accuracy rate based on a 
bootstrap algorithm. This of course does not reflect the true prediction accuracy rate of the model 
for future observations since the model is trained and evaluated on the same sample. Here, we use 
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Model Accuracy (%) Fı (%) 

MNL 85.6 (2.2) 79.1 (2.8) 
qMNL 86.1 (1.5) 79.7 (2.1) 
LSVM 87.2 (2.3) 80.6 (2.8) 
RBF-SVM 87.2 (2.7) 79.9 (3.2) 
dpMNL 87.7 (3.3) 82.6 (2.5) 


Table 4: Performance of models based on detecting Parkinson’s disease. Standard errors of esti- 
mates (based on 5 cross-validation folds) are provided in parentheses. 























Group Frequency | Age Average | Male Proportion 
1 107 66 (0.7) 0.86 (0.03) 
2 12 72 (1.1) 0.83 (0.11) 
3 36 63 (1.8) 0.08 (0.04) 
4 40 65 (2.2) 0.40 (0.08) 
Population 195 66 (0.7) 0.60 (0.03) 














Table 5: The age average and male proportion for each cluster (i.e., mixture component) identified 
by our model. Standard errors of estimates are provided in parentheses. 


a 5-fold cross validation scheme instead in order to obtain a more accurate estimate of prediction 
accuracy rate and avoid inflating model performance due to overfitting. As a result, our models 
cannot be directly compared to that of Little et al. (2008). 

We apply our dpMNL model to the same data set, along with MNL and qMNL (no interactions, 
only x; and x? as covariates). Although the observations from the same subject are not independent, 
we assume they are, as done by Little et al. (2008). Instead of selecting an optimum subset of fea- 
tures, we used PCA and chose the first 10 principal components. The MCMC algorithm for MNL, 
qMNL, and dpMNL ran for 3000 iterations (the first 500 iterations were discarded). Simulating 
the Markov chain took about 0.7 second per iteration (35 minutes per data set) for dpMNL and 
0.1 second per iteration (5 minutes per data set) for MNL using a MATLAB implementation on an 
UltraSPARC HI machine. Training the RBF-SVM model took 38 seconds for each data set. 

Using the dpMNL model, the most probable number of components in the posterior is four 
(note that might change from one iteration to another). Table 4 shows the average and standard 
errors (based on 5-fold cross validation) of the accuracy rate and the F; measure for MNL, LSVM, 
RBF-SVM, and dpMNL. (But note that the standard errors assume independence of cross-validation 
folds, which is not really correct.) 

While dpMNL provides slightly better results, the improvement is not statistically significant. 
However, examining the clusters (i.e., mixture components) identified by dpMNL reveals some 
information about the underlying structure in the data. Table 5 shows the average age of subjects 
and male proportion for the four clusters (based on the most probable number of components in 
the posterior) identified by our model. Note that age and gender are not available from the UCI 
Machine Learning Repository, and they are not included in our model. They are, however, available 
from Table 1 in Little et al. (2008). The first two groups include substantially higher percentages 
of male subjects than female subjects. The average age in the second of these groups is higher 
compared to the first group. Most of the subjects in the third group are female (only 8% are male). 


1844 


NONLINEAR MODELS USING DIRICHLET PROCESS MIXTURES 


The fourth group also includes more female subjects than male subjects, but the disproportionality 
is not as high as for the third group. 


When identifying Parkinson’s disease by detecting dysphonia, it has been shown that gender 
has a confounding effect (Cnockaert et al., 2008). By grouping the data into clusters, our model 
has identified (to some extent) the heterogeneity of subjects due to age and gender, even though 
these covariates were not available to the model. Moreover, by fitting a separate linear model to 
each component (i.e., conditioning on mixture identifiers), our model approximates the confounding 
effect of age and gender. For this example, we could have simply taken the age and gender of 
subjects from Table 1 in Little et al. (2008) and included them in our model. In many situations, 
however, not all the relevant factors are measured. This could result in unobservable changes in the 
structure of data. We discuss this concept in more detail elsewhere (Shahbaba, 2009). 


5. Discussion and Future Directions 


We introduced a new nonlinear classification model, which uses Dirichlet process mixtures to model 
the joint distribution of the response variable, y, and the covariates, x, non-parametrically. We com- 
pared our model to several linear and nonlinear alternative methods using both simulated and real 
data. We found that when the relationship between y and x is nonlinear, our approach provides sub- 
stantial improvement over alternative methods. One advantage of this approach is that if the rela- 
tionship is in fact linear, the model can easily reduce to a linear model by using only one component 
in the mixture. This way, it avoids overfitting, which is a common challenge in many nonlinear 
models. 


We believe our model can provide more interpretable results. In many real problems, the iden- 
tified components may correspond to a meaningful segmentation of data. Since the relationship 
between y and x remains linear in each segment, the results of our model can be expressed as a set 
of linear patterns for different data segments. 


Hyperparameters such as A in RBF-SVM and yin dpMNL can substantially influence the per- 
formance of the models. Therefore, it is essential to choose these parameters appropriately. For 
RBF-SVM, we optimized À using a validation set that includes 20% of the training data. Figure 4 
(a) shows the effect of à on prediction accuracy for one data set. The value of à with the highest 
accuracy rate based on the validation set was used to train the RBF-SVM model. The hyperparam- 
eters in our dpMNL model are not fixed at some “optimum” values. Instead, we use hyperpriors 
that reflect our opinion regarding the possible values of these parameters before observing the data, 
with the posterior for these parameters reflecting both this prior opinion and the data. Hyperpriors 
for regression parameters, B, facilitate their shrinkage towards zero if they are not relevant to the 
classification task. The hyperprior for the scale parameter y affects how many mixture components 
are present in the data. Instead of setting y to some constant number, we allow the model to decide 
the appropriate value of y, using a hyperprior that encourages a small number of components, but 
which is not very restrictive, and hence allows y to become large in the posterior if required to fit 
the data. Choosing unreasonably restrictive priors could have a negative effect on model perfor- 
mance and MCMC convergence. Figure 4 (b) illustrates the negative effect of unreasonable priors 
for y. For this data set, the correct number of components is two. We gradually increase py, where 
log(y) ~ N(uy,2), in order to put higher probability on larger values of y and lower probability on 
smaller values. As we can see, setting uy > 4, which makes the hyperprior very restrictive, results in 
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Figure 4: Effects of scale parameters when fitting a data set generated according to Simulation 1. 
(a) The effect of A in the RBF-SVM model. (b) The effect of the prior on the scale 
parameter, y ~ log-N(y,27), as uy changes in the dpMNL model. 


a substantial decline in accuracy rate (solid line) due to overfitting with a large number of mixture 
components (dashed line). 

The computational cost for our model is substantially higher compared to other methods such 
as MNL and SVM. This could be a preventive factor in applying our model to some problems. 
The computational cost of our model could be reduced by using more efficient methods, such as 
the “split-merge” approach introduced by Jain and Neal (2007). This method uses a Metropolis- 
Hastings procedure that resamples clusters of observations simultaneously rather than incrementally 
assigning one observation at a time to mixture components. Alternatively, it might be possible to 
reduce the computational cost by using a variational inference algorithm similar to the one proposed 
by Blei and Jordan (2005). In this approach, the posterior distribution P is approximated by a 
tractable variational distribution Q, whose free variational parameters are adjusted until a reasonable 
approximation to P is achieved. 


We expect our model to outperform other nonlinear models such as neural networks and SVM 
(with nonlinear kernel functions) when the population is comprised of subgroups each with their 
own distinct pattern of relationship between covariates and response variable. We also believe that 
our model could perform well if the true function relating covariates to response variable contains 
sharp changes. 


The performance of our model could be negatively affected if the covariates are highly correlated 
with each other. In such situations, the assumption of diagonal covariance matrix for x adopted 
by our model could be very restrictive. To capture the interdependencies between covariates, our 
model would attempt to increase the number of mixture components (i.e., clusters). This however 
is not very efficient. To address this issue, we could use mixtures of factor analyzers, where the 
covariance structure of high dimensional data is model using a small number of latent variables (see 
for example, Rubin and Thayer, 2007; Ghahramani and Hinton, 1996). 


In this paper, we considered only continuous covariates. Our approach can be easily extended 
to situations where the covariate are categorical. For these problems, we need to replace the nor- 
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mal distribution in the baseline, Go, with a more appropriate distribution. For example, when the 
covariate x is binary, we can assume x ~ Bernoulli(u), and specify an appropriate prior distribution 
(e.g., Beta distribution) for u. Alternatively, we can use a continuous latent variable, z, such that 
u = exp(z)/{1+exp(z)}. This way, we can still model the distribution of z as a mixture of nor- 
mals. For categorical covariates, we can either use a Dirichlet prior for the probabilities of the K 
categories, or use K continuous latent variables, z;,...,zx, and let the probability of category j be 
exp(z;)/ £5 exp(z;). 

Throughout this paper, we assumed that the relationship between y and x is linear within each 
component of the mixture. It is possible of course to relax this assumption in order to obtain more 
flexibility. For example, we can include some nonlinear transformation of the original variables 
(e.g., quadratic terms) in the model. 


Our model can also be extended to problems where the response variable is not multinomial. 
For example, we can use this approach for regression problems with continuous response, y, which 
could be assumed normal within a component. We would model the mean of this normal distribution 
as a linear function of covariates for cases that belong to that component. Other types of response 
variables (i.e., with Poisson distribution) can be handled in a similar way. 


In the protein fold prediction problem discussed in this paper, classes were regarded as a set of 
unrelated entities. However, these classes are not completely unrelated, and can be grouped into 
four major structural classes known as a, B, «/B, and «+8. Ding and Dubchak (2001) show the 
corresponding hierarchical scheme (Table 1 in their paper). We have previously introduced a new 
approach for modeling hierarchical classes (Shahbaba and Neal, 2006, 2007). In this approach, we 
use a Bayesian form of the multinomial logit model, called corMNL, with a prior that introduces 
correlations between the parameters for classes that are nearby in the hierarchy. Our dpMNL model 
can be extend to classification problems where classes have a hierarchical structure (Shahbaba, 
2007). For this purposse, we use a corMNL model, instead of MNL, to capture the relationship 
between the covariates, x, and the response variable, y, within each component. The results is a 
nonlinear model which takes the hierarchical structure of classes into account. 


Finally, our approach provides a convenient framework for semi-supervised learning, in which 
both labeled and unlabeled data are used in the learning process. In our approach, unlabeled data 
can contribute to modeling the distribution of covariates, x, while only labeled data are used to 
identify the dependence between y and x. This is a quite useful approach for problems where the 
response variable is known for a limited number of cases, but a large amount of unlabeled data can 
be generated. One such problem is classification of web documents. 
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